Magnificent Seven ETF Analysis¶

Spring 2024 Data Science Project¶

By: Dhruv Dewan, Anish Nandyala, Aditya Prashanth¶

Member 1: Dhruv Dewan, Contribution: 90% (did not contribute to G). 

Member 2: Anish Nandyala, Contribution: 90% (did not contribute to B). 

Member 3: Aditya Prashanth, Contribution: 90% (did not contribute to C).

"We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial."

Dhruv Dewan, Anish Nandyala, Aditya Prashanth - 05/07/2024

Dhruv Dewan - I worked together with Anish and Aditya on most sections of this final tutorial. Specific places where my contribution is more critical is in the creation of the LSTM model and it's layers, the creation and visualization of the stocks and technical indicators, and several of the financially backed explanations.

Anish Nandyala - I worked on all the sections but the data curation section for this final tutorial. I especially spent the most time in the data exploratory analysis section of the tutorial, and brainstorming the hypothesis tests and correlation tests to use. I also contributed to the visualization of the ML model for the loss plots.

Aditya Prashanth - I worked on a variety of sections across the proposal. My biggest contributions were in determining the features and creating the training and testing datasets used for the model, creating an initial regression model, and writing the insights and conclusion for the project.

Introduction¶

In today's fast-paced financial environment, navigating the stock market's continuously changing waves can feel like high-stakes gambling. With fortunes climbing and falling in a split second, investors constantly seek the keys to predictability and profitability. The concept of Exchange-Traded Funds (ETFs) has provided a diversified avenue for investment, offering exposure to a basket of assets within a single fund. Among these, the Magnificent Seven ETF (from Roundhill Investments) stands out, comprised of seven tech giants: Apple, Amazon, Meta (Facebook), Alphabet (Google), Tesla, Nvidia, and Microsoft. It is well known that these tech stocks and other tech stocks in general are very volatile and prone to great change. With the risk that this volatility brings, it accompanies great potential for profit as well. That is what our motivation for this project stems from.

Our project aims to utilize the power of machine learning to discover the hidden features of stock prediction. At its core, we aim to address a fundamental question: Can we leverage historical stock data, technical indicators, and other features to forecast the future performance of the Magnificent Seven ETF with machine learning?

In an industry where volatility rules the market, the ability to anticipate price movements with a degree of accuracy holds great value. Successful predictions empower investors to make informed decisions, mitigate risks, and capitalize on opportunities for profit. Furthermore, in the realm of ETFs, where the fortunes of multiple companies are intertwined, the stakes are even higher, and the potential rewards even more enticing. To achieve our goal, we will dive into the realm of technical analysis, a cornerstone of financial forecasting. By identifying and utilizing key technical indicators, widely recognized within the world of trading, we aim to construct a framework for predicting the future performance of the Magnificent Seven ETF. From moving averages to relative strength index (RSI), these indicators offer valuable insights into market trends, momentum, and sentiment, serving as the foundation upon which our predictive models will be built.

In summary, our project not only aims to create our own stock prediction model for the MAG7, but also to demonstrate the potential of machine learning in revolutionizing investment strategies.

In [40]:
# Libraries
import pandas as pd
import numpy as np
import yfinance as yf
import scipy
import warnings
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import EarlyStopping


# import plotting tools
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# interactive plot stuff
import plotly.graph_objects as go
import plotly

# technical analysis
import pandas_ta as ta

Data Curation¶

In Python, the yfinance library serves as a valuable tool for accessing financial data from Yahoo Finance. The download function from yfinance is used to retrieve historical market data pertaining to the Magnificent Seven ETF, by its ticker symbol 'MAGS'. The period of time specified is 1 year, because the MAG7 ETF has only existed since last April, and for the purpose of our project it makes sense to have data for a year for clean comparisons. Upon execution, the resulting dataset, representing the Magnificent Seven ETF's historical market activity, is stored in a pandas DataFrame format. We get a succinct preview of the first few rows using .head(), to get insight into the structure and content of the retrieved data.

Using the following link, you can read the documentation of the yfinance library and it's API documentation: https://pypi.org/project/yfinance/

In [41]:
mag7_data = yf.download('MAGS', period='1y')
[*********************100%%**********************]  1 of 1 completed

We first check the count of rows in our dataframe.

In [42]:
print(mag7_data.count())
Open         252
High         252
Low          252
Close        252
Adj Close    252
Volume       252
dtype: int64

Below we clean our data by dropping NA values and duplicates, just in case to have clean data prepped for comparisons, testing, and analysis.

In [43]:
mag7_data = mag7_data.dropna()
mag7_data = mag7_data.drop_duplicates()

The plot below displays the adjusted close prices of the Magnificent Seven ETF over one year, illustrating its performance over time. We choose the Adjusted Close price to visualize becuase of its ability to account for splits, dividend distributions, and other corporate actions that can affect the stock. This metric provides a smooth, consistent view of the ETF's performance, allowing us to make easy comparisons and analysis of the investments returns. By focusing on adjusted close prices, the plot emphasizes the ETF's overall performance. Which captures both capital appreciation and dividend distributions within the one-year timeframe.

In [44]:
plt.figure(figsize=(14,5))
sns.set_style("ticks")
sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green')
sns.despine()
plt.title("Adjusted Close of Magnificent Seven ETF over time", size='medium', color='black')
Out[44]:
Text(0.5, 1.0, 'Adjusted Close of Magnificent Seven ETF over time')

Below we describe our dataframe, which provides a concise statistical summary of the Magnificent Seven ETF's historical market data. It includes essential metrics like count, mean, standard deviation, minimum, maximum, and quartiles for the adjusted close prices. This summary offers quick insights into the distribution and characteristics of the ETF's performance, aiding in analysis and decision-making for investment strategies.

In [45]:
mag7_data.describe()
Out[45]:
Open High Low Close Adj Close Volume
count 252.000000 252.000000 252.000000 252.000000 252.000000 2.520000e+02
mean 32.922198 33.137563 32.626996 32.883524 32.799326 8.497460e+04
std 3.693590 3.719676 3.598968 3.673566 3.726901 1.248657e+05
min 25.882000 26.129999 25.882000 26.030001 25.917496 5.000000e+02
25% 30.204999 30.288751 29.908249 30.172500 30.042091 4.375000e+03
50% 31.625000 31.835000 31.353500 31.604000 31.467403 3.140000e+04
75% 36.417500 36.697500 36.000000 36.429998 36.429998 1.306250e+05
max 40.450001 40.493999 40.029999 40.380001 40.380001 1.282400e+06

Below we download the stock data for TSM (Taiwan Semiconductor Manufacturing Company).

Downloading stock data for TSM is motivated with the intention to explore its relationship with Nvidia's and Apple's stock performance. This correlation is particularly important because Nvidia and Apple sources its Graphics Processing Units (GPUs) from TSM. By analyzing the stock data of the company, we aim to discover patterns, trends, and potential interactions between their respective stock prices. Understanding this relationship can offer us valuable insights for our prediction since certain weightage of the Magnificent Seven comes from Nvidia and Apple.

In addition to TSM, we downloaded data on Emerson Electric (EMR), which is Tesla's key supplier, and we aim to utilize that correlation as part of the model as well.

Lastly, we have downloaded Intel data due to their involvement in the semiconductor industry. Several Mag 7 companies can directly be influenced by Intel's movement and in our EDA, we can test this correlation.

In [46]:
tsm_data = yf.download('TSM', period='1y')
emr_data = yf.download('EMR', period='1y')
intel_data = yf.download('INTC', period='1y')
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed

We proceed to clean the rest of these stock dataframes before any analysis.

In [47]:
tsm_data = tsm_data.dropna()
tsm_data = tsm_data.drop_duplicates()

intel_data = intel_data.dropna()
intel_data = intel_data.drop_duplicates()

intel_data = intel_data.dropna()
intel_data = intel_data.drop_duplicates()

Exploratory Data Analysis¶

Now that we have data that is ready to be used for analysis, we can start exploring this data to find high level relationships between the features of each dataset, and find comparisons between the different datasets we have chosen.

Hypothesis Testing¶

First, we want to check how the MAGS stock compares to the TSM stock due to the relationship mentioned above between Apple/Nvidia and TSM. We can first start off by checking how different these two distributions are. We can accomplish this by using a two-sample T-test.

Using the following link, you can read the documentation of the scipy library and it's API documentation: https://docs.scipy.org/doc/scipy/

Our hypothesis are listed below:

Null Hypothesis: There is no significant difference between the average adjusted close prices of the MAGS and TSM stocks over the last year.

Alternative Hypothesis: The average adjusted close prices of the stocks MAGS and TSM significantly diverge from each other over the last year.

In [48]:
t_stat, p_value = scipy.stats.ttest_ind(mag7_data['Adj Close'], tsm_data['Adj Close'])
print("T-statistic:", t_stat)
print("P-Value: ", p_value)
T-statistic: -60.47918730129428
P-Value:  1.1760013176125386e-232
In [49]:
combined_data = pd.concat([mag7_data['Adj Close'], tsm_data['Adj Close']], axis=1)
combined_data.columns = ['MAGS', 'EMR']

plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
sns.boxplot(data=combined_data)
plt.title('Comparison of Adjusted Close Prices: MAGS vs TSM')
plt.xlabel('Stock')
plt.ylabel('Adjusted Close Price')
plt.show()

We reject the null hypothesis as the p-value is much less than 0.05, which is our significance level (alpha). Let's visualize this difference in distribution for both these stocks. The hypothesis test simply shows that there is a large difference in the valuation of each stock. To fully gain more insight, we can move on to a correlation between the stocks to see if their movement is similar. If they are correlated in their movement, the TSM stock can be a great feature to use, as we can track it to gain certain insights into the value of the Magnificent Seven ETF.

Correlation Testing¶

Next, we can move on to see instead how the trends of these two stocks correlate over the past year. Let's first visualize the trends of each of the stocks individually.

In [50]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    plt.figure(figsize=(14,5))
    sns.set_style("ticks")
    sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green', label='MAGS')
    sns.lineplot(data=tsm_data, x="Date", y='Adj Close', color='purple', label='TSM')
    plt.title('Adjusted Close Prices of MAGS and TSM: 05-2023 to 05-2024')
    plt.legend()
    plt.show()

By just looking at this graph, we cannot visually see much correlation between the two stocks. This may be because the adjusted close stock prices for TSM vary on a bigger range over the year than the MAGS adjusted close prices. Let's do a more in-depth analysis using Pearson's correlation coefficient to see the relationship between the adjusted close prices of both of the stocks.

In [51]:
mags_close = mag7_data['Adj Close']
tsm_close = tsm_data['Adj Close']

correlation = np.corrcoef(mags_close, tsm_close)[0, 1]
print("Pearson's correlation coefficient:", correlation)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=mags_close, y=tsm_close)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(mags_close, tsm_close)
x_values = np.linspace(min(mags_close), max(mags_close), 100)
y_values = slope * x_values + intercept
plt.plot(x_values, y_values, color='black', label=f'Linear Regression (R={correlation:.2f})')
plt.title('Scatter Plot of Adjusted Close Prices: MAGS vs TSM')
plt.xlabel('MAGS Adjusted Close Price')
plt.ylabel('TSM Adjusted Close Price')
plt.show()
Pearson's correlation coefficient: 0.9396902858140396

The high correlation coefficient obtained from a Pearson correlation test between the prices of Magnificent Seven and TSM (Taiwan Semiconductor Manufacturing Company) suggests a strong linear relationship between the two stocks. In the context of the Magnificent Seven movement, this high correlation could indicate several interconnected factors.

Firstly, it reflects common industry trends that heavily influence the semiconductor sector, such as technological advancements, shifts in demand for electronic devices, and broader economic conditions. Secondly, it may reflect the intricate supply chain dependencies within the semiconductor industry, given TSM's role as a significant manufacturer of chips for various companies, including Nvidia and Apple. Changes in TSM's performance or production capacity could significantly impact Nvidia's operations, as a result affecting investor sentiment for both firms. Additionally, the correlation could mirror shared market sentiment towards the semiconductor industry as a whole, where positive or negative news may affect both Nvidia and TSM's stock prices concurrently.

We can evaluate the relationship between the MAGS stock and Emerson Electric, Co. (EMR) in the same way. We start by graphing the adjusted close data for the past year of both of these stocks.

In [52]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    plt.figure(figsize=(14,5))
    sns.set_style("ticks")
    sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green', label='MAGS')
    sns.lineplot(data=emr_data, x="Date", y='Adj Close', color='red', label='EMR')
    plt.title('Adjusted Close Prices of MAGS and EMR: 05-2023 to 05-2024')
    plt.legend()
    plt.show()

Looking at this graph, we can see a slight similarity in the trends of both of the stocks. We can further confirm this by finding the Pearson's correlation coefficient once more.

In [53]:
mags_close = mag7_data['Adj Close']
emr_close = emr_data['Adj Close']

correlation = np.corrcoef(mags_close, emr_close)[0, 1]
print("Pearson's correlation coefficient:", correlation)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=mags_close, y=emr_close)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(mags_close, emr_close)
x_values = np.linspace(min(mags_close), max(mags_close), 100)
y_values = slope * x_values + intercept
plt.plot(x_values, y_values, color='black', label=f'Linear Regression (R={correlation:.2f})')
plt.title('Scatter Plot of Adjusted Close Prices: MAGS vs EMR')
plt.xlabel('MAGS Adjusted Close Price')
plt.ylabel('EMR Adjusted Close Price')
plt.show()
Pearson's correlation coefficient: 0.8851207432962013

The high correlational coefficient between the Magnificent Seven ETF and EMR (Emerson Electric Co.), Tesla's supplier, indicates a strong statistical relationship between their performances. This suggests that changes in Mag 7's value are closely associated with corresponding changes in EMR's value.

This correlation could arise from many factors such as their business relationship, shared industry trends, and market changes. For us, understanding this correlation offers insights into predicting the Mag 7, with other industry knowledge that can be gained from the performance of EMR.

In [54]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    plt.figure(figsize=(14,5))
    sns.set_style("ticks")
    sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green', label='MAGS')
    sns.lineplot(data=intel_data, x="Date", y='Adj Close', color='purple', label='EMR')
    plt.title('Adjusted Close Prices of MAGS and Intel: 05-2023 to 05-2024')
    plt.legend()
    plt.show()
In [55]:
mags_close = mag7_data['Adj Close']
intel_close = intel_data['Adj Close']

correlation = np.corrcoef(mags_close, intel_close)[0, 1]
print("Pearson's correlation coefficient:", correlation)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=mags_close, y=intel_close)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(mags_close, intel_close)
x_values = np.linspace(min(mags_close), max(mags_close), 100)
y_values = slope * x_values + intercept
plt.plot(x_values, y_values, color='black', label=f'Linear Regression (R={correlation:.2f})')
plt.title('Scatter Plot of Adjusted Close Prices: MAGS vs Intel')
plt.xlabel('MAGS Adjusted Close Price')
plt.ylabel('Intel Adjusted Close Price')
plt.show()
Pearson's correlation coefficient: 0.5282295025081082

The average correlational coefficient of 0.5 between the Magnificent Seven ETF (Mag 7) and Intel suggests a moderately positive relationship. While their stock prices tend to move somewhat together, the correlation isn't extremely strong. This insight implies that the stocks differ, as their price movements aren't perfectly synchronized. Overall, after this test we understand that Intel may not be an insightful feature to use while predicting the ETF.

Technical Indicators¶

We are now calculating technical indicators using the pandas_ta library. This library streamlines the process of calculating these indicators, giving us more time to concentrate on getting insights and creating visualizations. Technical analysis involves scrutinizing historical market data, such as price and volume, to forecast future price movements. Technical indicators are mathematical calculations derived from market data (in this case price of Magnificent Seven ETF), that are used to help traders and analysts in making informed trading decisions.

The indicators we're adding encompass a range of values that represent different parts of the stock and market. Moving Average Convergence/Divergence (MACD) highlights the relationship between two moving averages, indicating trend direction. Relative Strength Index (RSI) measures the speed and change of price movements which can be used in identifying overbought or oversold stocks. By calculating these indicators, we can understand the market dynamics, potentially spotting trading opportunities much better. These technical indicators, if informative enough, can be used as features in our stock prediction model.

Using the following link, you can read the documentation of the pandas_ta library and it's API documentation: https://github.com/twopirllc/pandas-ta

In [56]:
mag7_data.ta.macd(append=True) # Moving Average Convergence/Divergence
mag7_data.ta.rsi(append=True) # Relative Strength Index
Out[56]:
Date
2023-05-08          NaN
2023-05-09          NaN
2023-05-10          NaN
2023-05-11          NaN
2023-05-12          NaN
                ...    
2024-05-01    49.234011
2024-05-02    53.678298
2024-05-03    58.522552
2024-05-06    61.645198
2024-05-07    60.260892
Name: RSI_14, Length: 252, dtype: float64
RSI Indicator¶

Below we plot the RSI indicator to visualize the state of the ETF.

In [57]:
plt.plot(mag7_data['RSI_14'], label='RSI')
plt.axhline(y=70, color='r', linestyle='--', label='Overbought (70)')
plt.axhline(y=30, color='g', linestyle='--', label='Oversold (30)')
plt.legend()

plt.xticks(rotation=45, fontsize=8)
plt.title('RSI Indicator')
Out[57]:
Text(0.5, 1.0, 'RSI Indicator')

Here the Relative Strength Index (RSI) falls within the normal range, and that suggests a balanced market sentiment without any extreme bullishness or bearishness. This basically means that there is no extreme movement in either direction, a stable price. In such instances, the RSI value usually oscillates between 30 and 70. A normal RSI indicates a stable trend. It suggests that the market is in a state of equilibrium, with neither side exerting overwhelming pressure. The RSI is not as extreme and may not be useful in a prediction for this ETF.

MACD Indicator¶

Calculating the Moving Average Convergence Divergence (MACD) provides us valuable insights into the Magnificent Seven ETF's historical market performance. The MACD is computed by taking the difference between two Exponential Moving Averages (EMAs), usually it is a shorter-term EMA (12-day) and a longer-term EMA (26-day). This calculation highlights the momentum and trend changes in the ETF's price movements over time. Interpreting the MACD involves looking at and following its relationship with a signal line, which is usually a 9-day EMA of the MACD line itself. When the MACD line crosses above the signal line, it indicates a bullish crossover, suggesting a potential uptrend in the market and in the ETF's price. on the Other hand, when the MACD line crosses below the signal line, it signals a bearish crossover, indicating a downtrend.

Note: A moving average is a calculation used to smooth out the changes in data by creating a constantly updated average of recent historical prices or values over a specific time period. It helps to identify the trends by reducing noise or random changes in the data, making it easier to visualize the underlying direction of the trend. In finance, moving averages are commonly applied to stock prices to analyze price movements over time and identify potential entry or exit points in the market. Becuase the moving average is free of noise, it provides us a very insightful lagging numerical feature to add to our model.

In [58]:
plotly.offline.init_notebook_mode()

fig = go.Figure()

fig.add_trace(go.Scatter(x=mag7_data.index, y=mag7_data['MACD_12_26_9'], mode='lines', name='MACD'))
fig.add_trace(go.Scatter(x=mag7_data.index, y=mag7_data['MACDs_12_26_9'], mode='lines', name='Signal'))

fig.add_trace(go.Bar(x=mag7_data.index, y=mag7_data['MACDh_12_26_9'], name='MACD Histogram', marker_color=['green' if val >= 0 else 'red' for val in mag7_data['MACDh_12_26_9']]))

# Customize the chart
fig.update_xaxes(rangeslider=dict(visible=False))
fig.update_layout(plot_bgcolor='#efefff', font_family='Monospace', font_color='#000000', font_size=20,width=1000)
fig.update_layout(title="MACD chart for Magnificent Seven ETF")
fig.show()

In this analysis of the MACD, we observe a bullish crossover between the MACD line and the signal line. This means that it is a sign of a uptrend in the market. This confirms a upwards trend in the ETF's future price movement. In a situation like this we are presented with an opportunity to invest and grow profit as we may consider entering or holding positions in the ETF with hopes of further price growth. By examining the historical MACD plot, we can gain valuable insights into past trend changes and momentum shifts of the fund. Overall, using this MACD analysis enhances our understanding of the Magnificent Seven ETF's market changes.

Primary Analysis¶

Now that we have insight on the important features and correlations in our data, we can start our primary analysis. From our previous analysis, we have concluded that we want to use the adjusted closing prices of Emerson Electric Co. (EMR), Taiwan Semiconductor Manufacturing Co. (TSM), and Intel (INTC), as well as the MACD of the Mag 7 stock, to predict the adjusted closing stock price of the Mag 7 stock itself. To do this we have decided to use a LSTM model.

Long Short-Term Memory Models (LSTMs) are used in stock analysis often. One of the main strengths of these models is its ability to capture and learn from long-term dependencies in sequential data. This is useful for our application as the model makes use of year-long stock data, and will need to capture patterns and trends that may not be initialliy obvious throughout the data, and learn from them in order to sufficiently predict the future stock price. These types of models are particularly useful for time-series data as well.

Let's start by compiling the features we want to use in our model into one dataframe.

In [59]:
model_data = pd.DataFrame({'Adj Close': mag7_data['Adj Close'],
                  'intel close': intel_data['Adj Close'], 
                  'tsm close': tsm_data['Adj Close'], 
                  'emr close': emr_data['Adj Close'], 
                  'mag7': mag7_data['MACD_12_26_9']})

model_data = model_data.dropna()

The above code puts all the relevant features, which are the Intel adjusted closing price, the TSM adjusted closing price, the EMR adjusted closing price, and the MACD for Mag7 in the model data. It also includes the label, which is the MAGS adjusted closing price, which is what we are trying to predict, as the first column.

We then move on to splitting the data into our training and testing data. We use 1 month of the year-long data as our test data, while using the rest as our training data. We also scale all the data to the range between 0 and 1. This is done to make sure all the features are weighed the same by the model, as to not give importance to any feature over the others.

In [60]:
test_start_date = model_data.index[-1] - pd.DateOffset(months=1)
train_data = model_data.loc[model_data.index < test_start_date]
test_data = model_data.loc[model_data.index >= test_start_date]

# Scale the data for model
scaler = MinMaxScaler(feature_range=(0, 1))
train_data_scaled = scaler.fit_transform(train_data)
test_data_scaled = scaler.transform(test_data)

n_steps = 4
n_features = len(model_data.columns)

# Creating Sequences
def create_sequences(data, n_steps):
    X, y = [], []
    for i in range(len(data) - n_steps):
        X.append(data[i:i + n_steps])
        y.append(data[i + n_steps, 0])  # Assuming the target is in the first column
    return np.array(X), np.array(y)

X_train, y_train = create_sequences(train_data_scaled, n_steps)
X_test, y_test = create_sequences(test_data_scaled, n_steps)
X_train = X_train.reshape((X_train.shape[0], n_steps, n_features))
X_test = X_test.reshape((X_test.shape[0], n_steps, n_features))

Above, we split the data into testing and training data, scaled the data, and also split the label from the features, formatting the features accordingly.

Now we can get into building the model.

In [61]:
# Creating and compilation of the LSTM model
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(50, return_sequences=True))
model.add(LSTM(50))
model.add(Dense(1))  # Output layer with 1 neuron for regression
model.compile(optimizer='adam', loss='mean_squared_error')
c:\Users\dhruv\anaconda3\envs\320_final\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning:

Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.

We have created a Sequential model with an input layer, hidden LSTM layers, and 1 output Dense layer. The first layer takes in the inputs of shape (n_steps, n_features), and each subsequent layer has 50 neurons. The final output layer compiles the values from the previous layers into one final output. We use Adam as our optimizer, and we use Mean Squared Error to compute the loss for each training epoch.

After this, we must train our model.

In [62]:
# Training
early_stopping = EarlyStopping(monitor='val_loss', patience=7)
history = model.fit(X_train, y_train, epochs=90, batch_size=64, verbose=2, validation_data=(X_test[7:], y_test[7:]))
epoch_loss = history.history['loss']
epoch_val_loss = history.history['val_loss']
Epoch 1/90
4/4 - 4s - 897ms/step - loss: 0.2199 - val_loss: 0.6842
Epoch 2/90
4/4 - 0s - 14ms/step - loss: 0.1527 - val_loss: 0.5007
Epoch 3/90
4/4 - 0s - 13ms/step - loss: 0.0790 - val_loss: 0.2749
Epoch 4/90
4/4 - 0s - 12ms/step - loss: 0.0270 - val_loss: 0.0811
Epoch 5/90
4/4 - 0s - 12ms/step - loss: 0.0363 - val_loss: 0.0630
Epoch 6/90
4/4 - 0s - 12ms/step - loss: 0.0277 - val_loss: 0.1281
Epoch 7/90
4/4 - 0s - 12ms/step - loss: 0.0173 - val_loss: 0.1770
Epoch 8/90
4/4 - 0s - 12ms/step - loss: 0.0192 - val_loss: 0.1638
Epoch 9/90
4/4 - 0s - 13ms/step - loss: 0.0151 - val_loss: 0.1097
Epoch 10/90
4/4 - 0s - 12ms/step - loss: 0.0093 - val_loss: 0.0672
Epoch 11/90
4/4 - 0s - 12ms/step - loss: 0.0088 - val_loss: 0.0521
Epoch 12/90
4/4 - 0s - 12ms/step - loss: 0.0066 - val_loss: 0.0689
Epoch 13/90
4/4 - 0s - 13ms/step - loss: 0.0052 - val_loss: 0.0718
Epoch 14/90
4/4 - 0s - 12ms/step - loss: 0.0048 - val_loss: 0.0508
Epoch 15/90
4/4 - 0s - 13ms/step - loss: 0.0041 - val_loss: 0.0363
Epoch 16/90
4/4 - 0s - 13ms/step - loss: 0.0044 - val_loss: 0.0403
Epoch 17/90
4/4 - 0s - 12ms/step - loss: 0.0041 - val_loss: 0.0478
Epoch 18/90
4/4 - 0s - 13ms/step - loss: 0.0041 - val_loss: 0.0491
Epoch 19/90
4/4 - 0s - 12ms/step - loss: 0.0040 - val_loss: 0.0464
Epoch 20/90
4/4 - 0s - 12ms/step - loss: 0.0039 - val_loss: 0.0430
Epoch 21/90
4/4 - 0s - 12ms/step - loss: 0.0039 - val_loss: 0.0474
Epoch 22/90
4/4 - 0s - 12ms/step - loss: 0.0038 - val_loss: 0.0535
Epoch 23/90
4/4 - 0s - 13ms/step - loss: 0.0038 - val_loss: 0.0526
Epoch 24/90
4/4 - 0s - 13ms/step - loss: 0.0037 - val_loss: 0.0477
Epoch 25/90
4/4 - 0s - 12ms/step - loss: 0.0037 - val_loss: 0.0470
Epoch 26/90
4/4 - 0s - 13ms/step - loss: 0.0036 - val_loss: 0.0521
Epoch 27/90
4/4 - 0s - 13ms/step - loss: 0.0037 - val_loss: 0.0515
Epoch 28/90
4/4 - 0s - 12ms/step - loss: 0.0036 - val_loss: 0.0447
Epoch 29/90
4/4 - 0s - 13ms/step - loss: 0.0036 - val_loss: 0.0392
Epoch 30/90
4/4 - 0s - 12ms/step - loss: 0.0036 - val_loss: 0.0462
Epoch 31/90
4/4 - 0s - 13ms/step - loss: 0.0035 - val_loss: 0.0482
Epoch 32/90
4/4 - 0s - 12ms/step - loss: 0.0035 - val_loss: 0.0439
Epoch 33/90
4/4 - 0s - 13ms/step - loss: 0.0035 - val_loss: 0.0452
Epoch 34/90
4/4 - 0s - 11ms/step - loss: 0.0036 - val_loss: 0.0516
Epoch 35/90
4/4 - 0s - 12ms/step - loss: 0.0036 - val_loss: 0.0358
Epoch 36/90
4/4 - 0s - 13ms/step - loss: 0.0034 - val_loss: 0.0314
Epoch 37/90
4/4 - 0s - 13ms/step - loss: 0.0034 - val_loss: 0.0405
Epoch 38/90
4/4 - 0s - 12ms/step - loss: 0.0034 - val_loss: 0.0368
Epoch 39/90
4/4 - 0s - 13ms/step - loss: 0.0033 - val_loss: 0.0286
Epoch 40/90
4/4 - 0s - 13ms/step - loss: 0.0033 - val_loss: 0.0393
Epoch 41/90
4/4 - 0s - 12ms/step - loss: 0.0038 - val_loss: 0.0425
Epoch 42/90
4/4 - 0s - 12ms/step - loss: 0.0033 - val_loss: 0.0208
Epoch 43/90
4/4 - 0s - 13ms/step - loss: 0.0040 - val_loss: 0.0245
Epoch 44/90
4/4 - 0s - 13ms/step - loss: 0.0032 - val_loss: 0.0391
Epoch 45/90
4/4 - 0s - 12ms/step - loss: 0.0035 - val_loss: 0.0285
Epoch 46/90
4/4 - 0s - 13ms/step - loss: 0.0031 - val_loss: 0.0248
Epoch 47/90
4/4 - 0s - 12ms/step - loss: 0.0030 - val_loss: 0.0290
Epoch 48/90
4/4 - 0s - 12ms/step - loss: 0.0032 - val_loss: 0.0284
Epoch 49/90
4/4 - 0s - 12ms/step - loss: 0.0030 - val_loss: 0.0206
Epoch 50/90
4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0219
Epoch 51/90
4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0196
Epoch 52/90
4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0197
Epoch 53/90
4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0167
Epoch 54/90
4/4 - 0s - 13ms/step - loss: 0.0029 - val_loss: 0.0146
Epoch 55/90
4/4 - 0s - 12ms/step - loss: 0.0028 - val_loss: 0.0187
Epoch 56/90
4/4 - 0s - 12ms/step - loss: 0.0028 - val_loss: 0.0158
Epoch 57/90
4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0153
Epoch 58/90
4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0102
Epoch 59/90
4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0184
Epoch 60/90
4/4 - 0s - 13ms/step - loss: 0.0032 - val_loss: 0.0147
Epoch 61/90
4/4 - 0s - 13ms/step - loss: 0.0026 - val_loss: 0.0065
Epoch 62/90
4/4 - 0s - 12ms/step - loss: 0.0031 - val_loss: 0.0120
Epoch 63/90
4/4 - 0s - 13ms/step - loss: 0.0027 - val_loss: 0.0115
Epoch 64/90
4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0063
Epoch 65/90
4/4 - 0s - 12ms/step - loss: 0.0032 - val_loss: 0.0069
Epoch 66/90
4/4 - 0s - 13ms/step - loss: 0.0027 - val_loss: 0.0100
Epoch 67/90
4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0100
Epoch 68/90
4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0120
Epoch 69/90
4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0108
Epoch 70/90
4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0105
Epoch 71/90
4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0086
Epoch 72/90
4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0114
Epoch 73/90
4/4 - 0s - 12ms/step - loss: 0.0026 - val_loss: 0.0067
Epoch 74/90
4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0051
Epoch 75/90
4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0113
Epoch 76/90
4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0066
Epoch 77/90
4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0053
Epoch 78/90
4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0078
Epoch 79/90
4/4 - 0s - 12ms/step - loss: 0.0027 - val_loss: 0.0053
Epoch 80/90
4/4 - 0s - 13ms/step - loss: 0.0026 - val_loss: 0.0037
Epoch 81/90
4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0090
Epoch 82/90
4/4 - 0s - 12ms/step - loss: 0.0026 - val_loss: 0.0062
Epoch 83/90
4/4 - 0s - 13ms/step - loss: 0.0026 - val_loss: 0.0054
Epoch 84/90
4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0079
Epoch 85/90
4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0038
Epoch 86/90
4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0049
Epoch 87/90
4/4 - 0s - 12ms/step - loss: 0.0024 - val_loss: 0.0043
Epoch 88/90
4/4 - 0s - 13ms/step - loss: 0.0023 - val_loss: 0.0047
Epoch 89/90
4/4 - 0s - 13ms/step - loss: 0.0023 - val_loss: 0.0049
Epoch 90/90
4/4 - 0s - 14ms/step - loss: 0.0023 - val_loss: 0.0056

We train the model for 90 epochs, passing in 64 batches of input data at a time. We use some of the test data as validation data to ensure the model is working correctly. We then calculate the training loss and validation loss over every epoch.

Now our model is ready to be used! We try to predict the outputs of the test data as follows.

In [63]:
# Predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
7/7 ━━━━━━━━━━━━━━━━━━━━ 1s 48ms/step
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 29ms/step

We predict the outputs using our trained model on both the training data and the testing data, and save the outputs in the train_predict and test_predict variables respectively. Now we can use these predictions and compare them to the ground truth values to see how well our model did.

We have one more step, which is to convert all our data back to its original format by undoing the scaling and reshaping the data.

In [64]:
# Convertion of data back into original scale to view
zeros_array = np.zeros((train_predict.shape[0], n_features-1))
train_predict_combined = np.hstack((train_predict, zeros_array))
train_predict = scaler.inverse_transform(train_predict_combined)
train_predict = train_predict[:, 0]
zeros_array = np.zeros((test_predict.shape[0], n_features-1))
test_predict_combined = np.hstack((test_predict, zeros_array))
test_predict = scaler.inverse_transform(test_predict_combined)
test_predict = test_predict[:, 0]
zeros_array_train = np.zeros((y_train.shape[0], n_features-1))
zeros_array_test = np.zeros((y_test.shape[0], n_features-1))
y_train_combined = np.hstack((y_train.reshape(-1, 1), zeros_array_train))
y_test_combined = np.hstack((y_test.reshape(-1, 1), zeros_array_test))
y_train_combined = scaler.inverse_transform(y_train_combined)
y_test_combined = scaler.inverse_transform(y_test_combined)
y_train = y_train_combined[:, 0]
y_test = y_test_combined[:, 0]

date_index = test_data.index[n_steps:]
predictions_df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': test_predict.flatten()}, index=date_index)

Now we have all of our data formatted correctly and our predictions stored in the predictions_df dataframe. Our next step is to visualize our findings.

Visualization¶

Let's start by comparing our predicted adjusted closing prices with the actual ground truth prices. This is shown in the graph below.

In [65]:
plt.figure(figsize=(12, 6))
plt.plot(predictions_df.index, predictions_df['Actual'], label='Actual Prices', color='b')
plt.plot(predictions_df.index, predictions_df['Predicted'], label='Predicted Prices', color='r')
plt.title('Actual vs. Predicted Stock Prices by Date')
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend()
plt.grid(True)
plt.show()

We can see that the predictions follow a similar trend as the actual values, although this is not as close as we would hope for. This is because of the limitations of the LSTM model as well as other factors, discussed below in the insights section.

We can also visualize our loss during training and validation over each training epoch, as shown below.

In [66]:
plt.plot(range(1,91),epoch_loss,label='Training Loss')
plt.xlabel("Epoch")
plt.ylabel('Training Loss')
plt.title('Training Loss over Epochs')
plt.legend()
plt.show()

plt.plot(range(1,91),epoch_val_loss, label='Validation Loss', color='r')
plt.xlabel("Epoch")
plt.ylabel('Validation Loss')
plt.title('Validation Loss over Epochs')
plt.legend()
plt.show()

We can see a clear downward trend in both of these graphs, indicating that the model was learning throughout the process and its predictions were getting better. The loss indicates how far the predicted values deviate from the actual ground truth values. The training loss is the loss calculated using the training data, whereas the validation loss shows the loss when the partially trained model is tested on a small portion of the data, called validation data.

Now that we have visualized our predictions and the loss, we can make certain insights on what all of this means, and how it is relevant to our project.

Insights and Conclusions¶

Through the creation of our model, we found that using the close values of companies directly involved in the creation of Magnificent 7 products gave us a valid predictor for the close value of the stock. Although our model is rudimentary and isn't a strong predictor for the stock's value, it generally follows the up-turns and down-turns of the stock in such a way that using this as a predictor is stronger than the mean. Through this project we learnt that perfectly trying to capture and predict the values of a stock requires a level of analysis and information much larger than that of our simple model. There were, additionally, a number of limitations to our progress that affected our ability to create a stronger predictor. For example, the fact that our moving average feature was only limited to the past year greatly inhibited the amount of training data we could build on. Additionally, the model we used (an LSTM) is highly susceptible to memorizing noise in a dataset, and as we only used a few features it is very likely that our model overfitted to detrement of our accuracy.

If we conducted further analysis, we would look for other indicators involved in the price fluctuation of Magnificent 7 stock - through exploring general market trends and online sentiment analysis- and we would invest into the stock using our model to test if extended use is indeed profitable. In all though, we utilized each part of the data science pipeline, from cleaning to feature engineering to machine learning analysis, to create a functional Recurrent Neural Network that gives investors an edge in Magnificent 7 investment.